---
title: "Lab: </br> Evaluating PROGRESA"
author: "Edward Vytlacil"
date: "June 22, 2022"
toc: true
format:
html:
html-math-method: mathjax
code-tools: true
self-contained: true
execute:
warning: false
reference-location: margin
citation-location: document
bibliography: handouts.bib
---
## Background
Poverty is one of the most important social and economic problems in the world. In order to alleviate it, many solutions were tried throughout the centuries. For example, in 1601, the Parliament of England passed the The Poor Relief Act, distributing money, food and clothing, and providing work and apprenticeships to the poor. In 1964, US Government created the Food Stamp program, providing food-purchasing assistance for low- and no-income people living in the U.S..
Such programs were criticized because they did not stimulate the recipients to invest in human capital nor improved children's educational outcomes. To address those two criticisms, many countries implemented conditional cash transfer programs (CCT), i.e., the government only transfers money to low-income persons who meet some criteria, such as enrolling their children into public schools, going to the doctor regularly and receiving vaccinations. The most famous CCT program was implemented in Mexico in 1997 and was initially called PROGRESA (its current name is Oportunidades).
Scientifically, PROGRESA is very influential because it was combined with a large-scale randomized control trial (RCT). In 1997, the Mexican government and the International Food Policy Research Institute defined which villages were eligible to receive this CCT program and, randomly, picked some villages to receive the transfers starting in 1998 (the treated group) and some villages to only receive the transfers starting in 1999 (the control group). The experimental results were very positive with respect to improving families' social, educational and economic outcomes.
::: column-margin
See @skoufias2001conditional and
@schultz2004school
for more details on PROGRESA and the experimental results.
:::
These positive results led to a rapid expansion of PROGRESA, which covered 2.6 million Mexican households in 50,000 villages by 2001 In addition, the success of PROGRESA stimulated many other countries to create similar programs (Bangladesh, Brazil, Cambodia, Chile, Colombia, Egypt, Guatemala, Honduras, Indonesia, Jamaica, Nicaragua, Panama, Peru, Phillipines, Turkey and the United States).
::: column-margin
See e.g. @attanasio2010children for Columbia.
:::
In this lab, we will look at the effect of PROGRESA on school enrollment using the data from the randomized controlled trial.
Here is a quick summary of the program:
1. **Eligibility for PROGRESA:** Only poor households living in poor, rural villages were eligible.
2. **PROGRESA Treatment:** Cash transfer to the *mother* of each household conditional on regular school attendance between 3rd and 9th grades for children less than 18 years old. The subsidy was slightly higher in 7th to 9th grade than in earlier grades to offset higher opportunity cost for older children, and slightly higher for girls than for boys in 7th through 9th grades to promote gender equality in schooling.
| Education Level of Student | | Monthly Payment |
|----------------------------|-------|-----------------|
| 3rd Year | | 70 |
| 4th Year | | 80 |
| 5th Year | | 105 |
| 6th Year | | 135 |
| 7th Year | Boys | 200 |
| | Girls | 210 |
| 8th Year | Boys | 210 |
| | Girls | 235 |
| 9th Year | Boys | 225 |
| | Girls | 255 |
3. **RCT of PROGRESA:** Approximately two-thirds of 495 rural villages were randomly assigned to be treated villages, and all eligible (sufficiently poor) households in such villages were treated. The remaining villages were randomly assigned to be control villages, and no households in those villages were treated.
quarto-executable-code-5450563D
```mermaid
flowchart LR
A{495 Villages} --> B{314 <br> Treatment<br> Villages}
A --> C{181 <br> Control<br> Villages }
B --> D[Eligible Children <br> Treated]
B --> E(Noneligible Children, <br> Not Treated)
C --> F[Eligible Children <br> Controls]
C --> G(Noneligible Children, <br> Not Treated)
```
4. **Survey Waves:** Five surveys were conducted as part of the RCT. The first two survey waves were conducted in October 1997 and March 1998, and were ``baseline'' surveys, i.e., were conducted before any household was treated. The last three survey waves were conduced in October 1998, May 1999, and November 1999. These waves are post randomization and after the treatment began. Note that the academic year in Mexico goes from late August to early July.
5. **Outcome** The outcome we will focus on is school enrollment, though these surveys contain a wealth
of other information.
## Data
As a first step, download the data set [PROGRESA.csv](https://edward-vytlacil.github.io/Data/PROGRESA.csv).
Alternatively, you can read the data set directly into **R**, as done below.
This data is from @lee2014multiple as posted at the [Journal of Applied Econometrics Data Archive](http://qed.econ.queensu.ca/jae/). This data is only a small part of the full PROGRESA data set,
see @DVN/05BMJY_2018. Also download and read [readme.txt](https://www.dropbox.com/s/ynqfuyliyvxmw8n/readme.txt?dl=0) which describes the data.
## Load Libraries
Start by loading the following libraries. Install them first if you have not already done so.
| Library | Purpose |
|------|---------|
|`stargazer` | package by @hlavac2022stargazer for creating tables. Creates publication-ready tables that are especially well-suited for reporting results in the standard format for academic papers in economics and other social sciences. |
|`ggplot2` | package by @ggplot for creating figures. Is more complicated to use than the graphing options in base **R**, but much more powerful and flexible. |
|`plotly` | package by @plotly for creating interactive figures.|
quarto-executable-code-5450563D
```r
#| label: install-packages
#| eval: false
# install packages if not already installed
install.packages("stargazer")
install.packages("ggplot2")
install.packages("plotly")
```
quarto-executable-code-5450563D
```r
#| label: load-packages
# load packages (make sure they are already installed)
library(stargazer)
library(ggplot2)
library(plotly)
```
::: {.callout-note collapse="true"}
# Instructions for Installing and Loading Libraries
Instruction for installing and loading **R** libraries includes:
- @fastR [Lesson 25](https://github.com/matloff/fasteR#--lesson-25--r-packages-cran-etc)
- @grolemund2014hands [Appendix B]( https://jjallaire.github.io/hopr/a2-packages.html)
:::
::: {.callout-note collapse="true"}
# Other Packages for Creating Tables.
There are other
packages that are more popular for producing tables within data science, and often produce very beautiful tables,
though not typically in a format standard for academic journals in economics.
The most standard way to create tables in **R** is with `kable` from the
`knitr` package, see description [here](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html). We will
not be using the pipe operator `%>%` from the `magrittr` package that is part of `tidyverse`, but, for those
using the pipe operator, the `kableExtra` package by @zhu2019kableextra is very popular, see description
[here](https://bookdown.org/yihui/rmarkdown-cookbook/kableextra.html). These packages as well
as other good options for creating tables are discussed [here](https://rfortherestofus.com/2019/11/how-to-make-beautiful-tables-in-r/).
:::
## Read in the data
- Use `read.csv` to read the data into a `data.frame`.
quarto-executable-code-5450563D
```r
#| output: true
df_ <- read.csv("https://edward-vytlacil.github.io/Data/PROGRESA.csv", header=TRUE, sep=",") # read csv
```
::: {.callout-note collapse="true"}
# Data Frames
To review data frames in **R**, see:
- @fastR [Lesson 5](https://github.com/matloff/fasteR#--lesson-5--on-to-data-frames)
- @grolemund2014hands, [Section 3.8](https://jjallaire.github.io/hopr/objects.html#data-frames);
While we are using data frames, [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) is an enhanced version of data frames.
:::
::: {.callout-note collapse="true"}
# Reading Data into **R**
Instruction for reading data into **R** include:
- @grolemund2014hands, [Appendix D](https://jjallaire.github.io/hopr/a4-data.html);
:::
## Examine the data
- Examine data, make sure we understand variables, data
structure, and look for any issues with the data.
- Explore the data with **R** functions such as `names`, `dim`, `head`, `table` and `summary`.
::: column-margin
You can use the `help` function to see documentation on a function. For example, `help(names)` gives you the
help page for the function `names`.
:::
::: column-margin
Use the codebook [readme.txt](https://www.dropbox.com/s/ynqfuyliyvxmw8n/readme.txt?dl=0) to understand the definition of each variable.
:::
quarto-executable-code-5450563D
```r
#| output: true
names(df_) # name of each column (variable) in data frame df_
dim(df_) # number of rows (observations) and columns (variables) in data frame
head(df_) # first 6 rows (observations) of data frame df_
table(df_$wave) # frequency counts for variable wave
table(df_$progresa1) # frequency counts for variable progresa (treatment status)
table(df_$hgc1) # frequency counts for variable hgc1 (highest grade completed)
summary(df_) # descriptive statistics of all variables in data frame df_
```
Notes:
- The data has `r dim(df_)[1]` child-wave observations, not `r dim(df_)[1]` children, since each child can
appear in the data up to 5 times.
- The data has an uneven number of observations per wave, in other words, this is an *unbalanced* panel.
- <TT>poor==1</TT> for all observations, so that all observations are poor enough to be eligible for PROGRESA. We can
therefore drop the variable <TT>poor</TT>.
- The youngest child is `r min(df_$age1)` years old, while the oldest child is `r max(df_$age1)` years old. With the code below, we find that, while the vast majority of rows have an age between 6 and 18, `r sum(df_$age1<6)` row has a child less than 6 years old, and `r sum(df_$age1>18)` rows report the child being older than 18 years old. These ages are probably
data entry errors. We will later code ages under 6 or over 18 as missing, believing that such values are likely to be data entry errors.
quarto-executable-code-5450563D
```r
#| output: false
sum(df_$age1<6)
# 1
sum(df_$age1>=6 & df_$age1<=18)
# 73769
sum(df_$age1>18)
# 261
sum(df_$age1==99)
# 10
```
::: {.callout-note collapse="true"}
# Dummy Variables
A *dummy variable* is a variable that always equals $0$ or $1$. They are also called
*logical indicator variables.* We typically use them to represent TRUE/FALSE, YES/NO
type data -- for example, you are over age $18$ or you are not.
The
mean of a dummy variable is the fraction of observations for which the variable equals
$1$. Likewise, the sum of a dummy variable is the number of observations for which the variable equals
$1$.
The following code defines a new dummy variable for being over age 18, and computes the
number of observations over age 18 and the fraction of observations over age 18.
quarto-executable-code-5450563D
```r
#| eval: false
df_$Dummy1 <- df_$age1>18 #Create dummy variable for age1>18
sum(df_$Dummy1) # number of observations with age1>18
mean(df_$Dummy1) # fraction of observations age1>18
```
The following code is equivalent, without the step of defining the dummy variable:
quarto-executable-code-5450563D
```r
#| eval: false
sum(df_$age1>18) # number of observations with age1>18
mean(df_$age1>18) # fraction of observations with age1>18
```
:::
::: {.callout-note collapse="true"}
# Further investigating ages
We could further investigate the observations
with unusual ages. For example, we could inspect
the one row with a reported age of $0$, and
see that the 0-year old child has already completed
first grade, making it unlikely that the reported
age of zero is correct:
quarto-executable-code-5450563D
```r
df_[df_$age1==0,]
```
The observation's child id is `r df_[df_$age1==0,]$sooind_id` and we could inspect
the other waves for the same child.
quarto-executable-code-5450563D
```r
df_[df_$sooind_id==1586,]
```
We see that the same child was coded as having the following ages:
- Wave 1, age `r df_[df_$sooind_id==1586 & df_$wave==1 ,]$age1`;
- Wave 3, age `r df_[df_$sooind_id==1586 & df_$wave==3 ,]$age1`;
- Wave 4, age `r df_[df_$sooind_id==1586 & df_$wave==4 ,]$age1`;
- Wave 5, age `r df_[df_$sooind_id==1586 & df_$wave==5 ,]$age1`
Clearly, the child age couldn't jump from age 7 to
age 99 between waves 1 and 3, and couldn't go *down* in age from 99 years old to 0 years old from
wave 3 to wave 4. Clearly the ages of 0 and 99
are coded incorrectly. We could guess that the
child is actually 7 years old in all the waves
(*imputing* the ages in waves 3 and 4 based on the
ages in waves 1 and 5), but here we will just later code the 0 and 99 as missing.
:::
## Number of Observations
The number of observations plays a critical role in statistical analysis, including
our choice of research methodology. The more (independent) observations:
- the more precise the estimates;
- the better the approximations coming from the central limit theorem and other asymptotic results;
- the higher the power of tests to reject false hypotheses, and the narrower our confidence intervals;
- the more flexible are the models that we can feasibly fit to the data.
We have `r nrow(df_)` rows in our data frame. It is a panel data set, where each row is a child-wave observations, with
each child appearing in the data frame up to $5$ times. We shouldn't think of number of child-wave observations as
number of independent observations. For example, a given child being a boy
in wave one is not independent of the same child being a boy in wave two. We might think that independence across children is reasonable, or at least
a reasonable approximation, and thus might consider our number of observations
to be the number of children when thinking about statistical precision or
accuracy of asymptotic approximations. On the other hand,
the experiment assigned treatment at the village level, and we might think children within a village (certainly within the same family) are
not independent. We might therefore think that we should think of the number of villages as the number of observations when considering statistical precision and asymptotic approximations.
- Recall
- Child id is <TT>sooind_id</TT>.
- Village id is <TT>sooloca</TT>.
- Using `nrow`, we find that the data has
- `r nrow(df_)` rows (child-wave observations)
- Using `length` and `unique`, we find that the data has
- `r length(unique(df_$sooind_id))` child-observations in the data, of which
`r length(unique(subset(df_,df_$progresa1==1)$sooind_id))` are treated and
`r length(unique(subset(df_,df_$progresa1==0)$sooind_id))` are control.
- `r length(unique(df_$sooloca))` villages, of which `r length(unique(subset(df_,df_$progresa1==1)$sooloca))` are treated and `r length(unique(subset(df_,df_$progresa1==0)$sooloca))` are control.
quarto-executable-code-5450563D
```r
#| eval: false
nrow(df_) # Number of rows (child-wave observations)
# 74031
length(unique(df_$sooind_id)) #Number of children
# 15669
length(unique(subset(df_,df_$progresa1==1)$sooind_id)) #Number of treated children
# 9799
length(unique(subset(df_,df_$progresa1==0)$sooind_id)) #Number of control children
# 5870
length(unique(df_$sooloca)) #Number of villages
# 491
length(unique(subset(df_,df_$progresa1==1)$sooloca)) #Number of treated villages
# 308
length(unique(subset(df_,df_$progresa1==0)$sooloca)) #Number of control villages
# 183
```
::: {.callout-note collapse="true"}
# Length and Unique Functions
- `unique` is a function whose argument is a vector (or data.frame) and which returns the unique values of that vector. For example:
quarto-executable-code-5450563D
```r
#| eval: false
a<- c(1,1,2,2,3,4,5)
unique(a)
# 1 2 3 4 5
b<- c("Ed","Joe","Ed","Lisa","Joe","Karen")
unique(b)
# "Ed" "Joe" "Lisa" "Karen"
```
- `length` is a function whose argument is a vector and which returns the length of that vector. For example:
quarto-executable-code-5450563D
```r
#| eval: false
length(a)
# 7
length(b)
# 6
a.0 <- unique(a)
b.0 <- unique(b)
length(a.0)
# 5
length(b.0)
# 4
```
:::
::: {.callout-note collapse="true"}
# Nested Functions
In **R** one can nest functions. For example: `length(unique(a))` is a nested function,
and should be interpreted from the inner most function (inner most parentheses) outward.
First the function `unique` is applied the vector <TT>a</TT>, finding the unique elements
of <TT>a</TT>, and then `length` finds the length of that vector and thus the number of unique
elements of <TT>a</TT>. The following code is equivalent:
quarto-executable-code-5450563D
```r
#| eval: false
length(unique(c(1,2,1,2,3,4,4,5)))
# 5
a<- c(1,2,1,2,3,4,4,5)
a.0 <- unique(a)
length(a.0)
# 5
```
Likewise, the code `length(unique(subset(df_,df_$progresa1==0)$sooind_id))` has nested functions, and should be interpreted from the inner most function (inner most parentheses) outward. In particular:
- `subset(df_,df_$progresa1==0)` is taking the subset of the data frame <TT>df_</TT> that
has the variable ``progresa1==0`` (untreated observations). `subset(df_,df_$progresa1==0)$sooind_id`
is taking the column (variable) <TT>sooind_id</TT> (child id) from the resulting data frame.
- `unique(subset(df_,df_$progresa1==0)$sooind_id)` is taking the unique elements of the vector `subset(df_,df_$progresa1==0)$sooind_id`, in other words, the unique child id numbers from the
data frame of control observations.
- `length(unique(subset(df_,df_$progresa1==0)$sooind_id))` is taking the length of the vector `unique(subset(df_,df_$progresa1==0)$sooind_id)`, in other words, how many unique child id numbers are there
for control children.
We can use nested functions, which generally results in more concise code, though the code can potentially
be confusing. The following code
is equivalent:
quarto-executable-code-5450563D
```r
#| eval: false
length(unique(subset(df_,df_$progresa1==0)$sooind_id))
#5870
c.0 <- subset(df_,df_$progresa1==0)$sooind_id
c.1 <- unique(c.0)
length(c.1)
#5870
```
:::
## Preparing data for analysis
Before we actually start analyzing the data, we will prepare it for analysis. In particular, we will complete the folowing steps:
- Change name of variables for convenience:
- <TT>progresa1</TT> to <TT>treat</TT>,
- <TT>sex1</TT> to <TT>girl</TT> .
- Using the `subset` command, estrict data to first wave (first
baseline wave) and fourth wave (second post-treatment wave) and keep
only relevant variables.
- Using the `ifelse` command, code as missing any age less than 6
or greater than 18.
- Following @schultz2004school create balanced sample, and additionally restrict to children aged 6 to 16 in first wave.
- Using the `ifelse` command, create an indicator variable for whether an observation is from the post period, i.e., a variable that equals 1 if the observation is from waves 5.
- Create seperate data frames for pre and post treatment observations.
quarto-executable-code-5450563D
```r
df_$treat <- df_$progresa1
df_$girl <- df_$sex1
df<-subset(df_, wave==4 | wave==1,
select= c(sooloca,sooind_id,age1,hgc1,school,treat,girl,wave))
rm(df_)
df$age1 <- ifelse(df$age1>=6 & df$age1<=18, df$age1, NA) # setting implausible ages to missing
balanced_ids <- intersect(df[df$wave==4,"sooind_id"],
df[df$wave==1 & df$age1>=6 & df$age1<=16,"sooind_id"])
df <- df[df$sooind_id %in% balanced_ids,]
df$post <- ifelse(df$wave==4, 1, 0) # creating dummy variable for post period
dfPre <- df[df$post==0,] # creating new data.frame with only pre treatment observations
dfPost <- df[df$post==1,] # creating new data.frame with only post treatment observations
```
::: {.callout-note collapse="true"}
# Data Wrangling
- This data is already fairly "clean". However, data often
needs more extensive preparation for analysis, sometimes called "data wrangling".
- For methods to help with more involved "data wrangling" in **R**, see either:
- [R Workflow](http://hbiostat.org/rflow/) by Frank Harrell, approach based on `HMISC`.
- [R for Data Science](https://r4ds.had.co.nz) by Hadley Wickham and Garret Grolemund, approach based on `tidyverse`.
:::
## Balance at Baseline: Summary Table
If the
randomization is done correctly, then any differences at baseline (pre-treatment)
between those
assigned to treatment vs control is due purely to chance.
However, by pure
chance, random assignment can result in differences between those
assigned to treatment vs control.
For this
reason, it is common in experimental papers to produce a table examining
whether, on average, before receipt of treatment,
those assigned to treatment have similar values of the covariates
as those assigned to control. The tables include mean baseline covariates, separately
for those assigned to treatment and those assigned to control. The tables often also include a
standardized difference, the difference between the two groups
dividing by the pooled estimated standard deviation.
The tables
often includes t-tests of the null
of equal means, though such hypothesis testing in this context is hard to justify.
[see @bruhn2009pursuit Section G].
Researchers sometimes re-randomize treatment assignment if the baseline is not sufficiently balanced.
Doing so does typically improve precision, similar to proper
linear regression adjustment for baseline characteristics [@li2018asymptotic; @lin2013agnostic; and @negi2021revisiting]. However,
doing so invalidates conventional standard errors and inference
procedures, and is inferior to other methods to design the experiment or
to adjust for differences at baseline.
See, e.g., @bai2022optimality.
:::: {.columns}
::: {.column width="38%"}
Create a table reporting the mean
of baseline characteristics separately by treatment assignment, the difference
in those means, and the standardized difference (dividing the difference by
the pooled estimated standard deviation).
:::
::: {.column width="4%"}
:::
::: {.column width="58%"}
```{r balance1}
#| output: asis
#| echo: false
return_mean_by_treatment <- function(x){
means.t<-tapply(x,dfPre$treat,mean)
var.t<-tapply(x,dfPre$treat,var)
return(c(means.t,var.t))
}
vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)
means <- output[1:2,]
diffs <- output[2,]-output[1,]
N1 <- sum(dfPre$treat)
N0 <- sum(1-dfPre$treat)
pooled.sd <- sqrt(((N0-1)*output[3,]+(N1-1)*output[4,])/(N0+N1-2))
std.diffs <- (output[2,]-output[1,])/pooled.sd
results0 <- t(rbind(means,diffs,std.diffs))
colnames(results0)<-c("Control","Treated","Diff.","Std Diff")
varlabels <- c("Girl","Age","Highest Grade","Enrolled")
rownames(results0)<-c(varlabels)
stargazer(results0, type="html", digits=2, title="Balance at Baseline")
```
:::
::::
```{r balance1}
#| eval: false
#| attr.source: ".numberLines"
```
::: {.callout-note collapse="true"}
# `tapply` and `sapply`
`tapply` allows you to apply a function separately by groups. The general form
is `tapply(X,INDEX,FUN)` which applies
the function `FUN` to the vector `X` separately
by levels of `INDEX`.
For example,
`tapply(dfPre$age1,dfPre$treat,mean)` applies
the function `mean` to the vector `dfPre$age1`
by levels of `dfPre$treat` (treated vs
control). Using that code, we find that
the mean age of controls is `r round(tapply(dfPre$age1,dfPre$treat,mean)[1],2)` and the mean age of treated is
`r round(tapply(dfPre$age1,dfPre$treat,mean)[2],2)`
::: column-margin
For more on `tapply` and `sapply`, see @fastR [Lesson 8](https://github.com/matloff/fasteR#--lesson-9--the-tapply-function) and [Lesson 28](https://github.com/matloff/fasteR#--lesson-28--should-you-use-functional-programming).
:::
quarto-executable-code-5450563D
```r
tapply(dfPre$age1,dfPre$treat,mean) # mean age by treatment status
```
`sapply` allows you to apply a function to each element of a list, vector or data.frame.
The general form
is `sapply(X,FUN)` which applies
the function `FUN` to each element of `X`.
For example, `sapply(dfPre,mean)` applies
the function `mean` to each column (each variable) in the data frame <TT>dfPre</TT>. In the following
code, we first define a list of variables <TT>vars</TT>, so that
`dfPre[vars]` is a data frame only containing the variables in
<TT>vars</TT>, and then `sapply(dfPre[vars],mean)` is applying the function
`mean` to each of those variables.
quarto-executable-code-5450563D
```r
vars <- c("girl","age1","hgc1","school")
sapply(dfPre[vars],mean) # mean of each column in dfPre[vars]
```
Using `tapply` and `sapply` can help you create efficient code and save
you time.
::: aside
For more on user defined functions see @grolemund2014hands Sections [1.4](https://jjallaire.github.io/hopr/basics.html#sec-write-functions)
and [1.5](https://jjallaire.github.io/hopr/basics.html#arguments), or
@fastR Lessons [16](https://github.com/matloff/fasteR#--lesson-16--writing-your-own-functions) and [18](https://github.com/matloff/fasteR#--lesson-18--functions-with-blocks).
:::
:::
::: {.callout-note collapse="true"}
# User Defined Functions
You can write your own functions in **R**. Objects of class "function" in **R** have the form:
quarto-executable-code-5450563D
```r
#| echo: true
#| eval: false
# defining function
function.name <- function(<arguments>){
# Function body, code taking arguments as inputs
}
# calling function
function.name(<arguments>)
```
The last value computed in the function body is the value returned by the function. For example, the following function adds one to its argument:
quarto-executable-code-5450563D
```r
#| echo: true
#| eval: false
f.addone <- function(x){
x+1
}
f.addone(1)
# 2
f.addone(10)
# 11
```
Defining functions can help make our code more efficient.
Consider:
quarto-executable-code-5450563D
```r
#| echo: true
#| eval: true
return_mean_by_treatment <- function(x){
means.t<-tapply(x,dfPre$treat,mean)
var.t<-tapply(x,dfPre$treat,var)
return(c(means.t,var.t))
}
vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)
```
The above code is defining a function named `return_mean_by_treatment`. It will take a vector as it's argument. It performs the following steps:
1. calculates the mean of the vector separately by treatment status;
2. calculates the variance of vector separately by treatment status;
3. returns a means and vectors calculated in steps (1) and (2).
Using `sapply`, the code then applies that function to each column in `dfPre[vars]`.
::: column-margin
For more on user defined functions see @grolemund2014hands Sections [1.4](https://jjallaire.github.io/hopr/basics.html#sec-write-functions)
and [1.5](https://jjallaire.github.io/hopr/basics.html#arguments), or
@fastR Lessons [16](https://github.com/matloff/fasteR#--lesson-16--writing-your-own-functions) and [18](https://github.com/matloff/fasteR#--lesson-18--functions-with-blocks).
:::
:::
## Balance at Baseline: School Enrollment at Each Grade
Consider balance in school enrollment at each grade level. In particular,
create a figure showing fraction enrolled in school at each grade
by treatment assignment. Create the figure separately for boys and girls.
::: column-margin
To learn more about `ggplot`, see [http://r-statistics.co/ggplot2-Tutorial-With-R.html](http://r-statistics.co/ggplot2-Tutorial-With-R.html).
:::
```{r baselineenroll}
MeanSchC.s <- with(subset(dfPre, treat==0),tapply(school, list(hgc1,girl), mean))
MeanSchT.s <- with(subset(dfPre, treat==1),tapply(school, list(hgc1,girl), mean))
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
MeanSch.b <- matrix(c(MeanSchC.s[,1],MeanSchT.s[,1]), nrow = 20, ncol = 1)
MeanSch.g <- matrix(c(MeanSchC.s[,2],MeanSchT.s[,2]), nrow = 20, ncol = 1)
tab.b <- data.frame(Grade, Group, MeanSch.b)
tab.g <- data.frame(Grade, Group, MeanSch.g)
plotPre.b <- ggplot(tab.b, aes(x = Grade, y = MeanSch.b, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
xlab("Grade Level") +
ylab("Mean School Enrollment")+
ggtitle("Boys: School Enrollment by Grade, Pre Period")
plotPre.g <- ggplot(tab.g, aes(x = Grade, y = MeanSch.g, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
xlab("Grade Level") +
ylab("Mean School Enrollment")+
ggtitle("Girls: School Enrollment by Grade, Pre Period")
```
quarto-executable-code-5450563D
```r
#| eval: false
ggplotly(plotPre.b,tooltip="y")
ggplotly(plotPre.g,tooltip="y")
```
::: {.panel-tabset}
## Enrollment Boys
```{r baselineenroll1}
#| echo: false
#| column: page-right
ggplotly(plotPre.b,tooltip="y")
```
## Enrollment Girls
```{r baselineenroll2}
#| echo: false
#| column: page-right
ggplotly(plotPre.g,tooltip="y")
```
:::
## Using Post Data to Estimate Effect
Now, use mean differences on post-treatment data
to estimate the effect of PROGRESA on school enrollment.
By random assignment, the mean difference estimator does
not suffer from selection bias. Furthermore, we can take mean difference conditional on any covariate that is not effected by the treatment,
including any baseline characteristic, to estimate a conditional
average treatment effect.
- Overall: `r round(mean(dfPost[ dfPost$treat==1,"school"]) - mean(dfPost[ dfPost$treat==0,"school"]),digits=3)`,
- By sex:
- For boys: `r round(mean(dfPost[ dfPost$treat==1 & dfPost$girl==0,"school"]) - mean(dfPost[ dfPost$treat==0 & dfPost$girl==0,"school"]),digits=3)`,
- For girls: `r round(mean(dfPost[ dfPost$treat==1 & dfPost$girl==1,"school"]) - mean(dfPost[ dfPost$treat==0 & dfPost$girl==1,"school"]),digits=3)`.
quarto-executable-code-5450563D
```r
#| eval: false
mean(dfPost[ dfPost$treat==1,"school"]) - mean(dfPost[ dfPost$treat==0,"school"])
# 0.04063189
mean(dfPost[dfPost$treat==1&dfPost$girl==0,"school"])-mean(dfPost[dfPost$treat==0&dfPost$girl==0,"school"])
# 0.02215276
mean(dfPost[dfPost$treat==1&dfPost$girl==1,"school"])- mean(dfPost[dfPost$treat==0&dfPost$girl==1,"school"])
# 0.05975947
```
## Using Post Data to Estimate Effect by Grade Level
Now, create a figure examining the effect of PROGRESA on
enrollment by grade level, separately for boys and girls.
quarto-executable-code-5450563D
```r
#| code-fold: true
#| column: page-right
MeansC <- with(dfPost,tapply(school,list(treat,girl,hgc1),mean))
dimnames(MeansC)[[1]] <-c("Control","Treated")
dimnames(MeansC)[[2]] <-c("Boys","Girls")
EffSchBoy <- (MeansC[2,1,] - MeansC[1,1,])
EffSchGirl <- (MeansC[2,2,] - MeansC[1,2,])
Grade <- as.factor(c(1:10, 1:10))
EffSch <- matrix(c(EffSchBoy, EffSchGirl), nrow = 20, ncol = 1)
sex <- c(rep("Boy", 10), rep("Girl", 10))
tabEff <- data.frame(Grade, sex, EffSch)
plotEff <- ggplot(tabEff, aes(x = Grade, y = EffSch, fill = sex)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = -0.1, to = 0.1, by = 0.05)) +
xlab("Grade Level") +
ylab("Mean School Enrollment") +
ggtitle(" Treatment Effects Estimates by Grade and Sex")
ggplotly(plotEff,tooltip="y")
```
Some results are surprising, e.g. the estimated effect on boys in first grade is `r round(EffSchBoy[1],3)`. Presumably, some of these results are driven by random sampling noise. The level of random sampling noise depends on the number of observations (and the level of dependence/independence across observations).
With i.i.d. data,$$\mbox{Var}(\bar{X}_N)=\frac{\sigma^2}{N},$$ where $\sigma^2 = \mbox{Var}(X_i)$. Estimating a mean difference on a small subgroup will
have more sampling noise than an estimate on the full sample -- we expect
better precision for the estimate for the full sample than in an estimate for
girls entering grade $2$. In addition, estimating results separately by
many subgroups can give rise to spurious results due to data mining. We will
return to these issues.
## No. of Obs. by Grade-Sex-Treatment
Examine:
1. The number of boys in each grade by treatment status;
2. The number of girls in each grade by treatment status;
3. The number of villages with boys in each each grade by treatment status;
4. The number of villages with boys in each each grade by treatment status;
These counts should give us some idea of how much sampling noise to expect for each estimated treatment effect by sex and grade.
quarto-executable-code-5450563D
```r
#| code-fold: true
person.counts <- with(dfPost,tapply(sooind_id,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(person.counts)[[1]] <-c("Control","Treated")
dimnames(person.counts)[[2]] <-c("Boys","Girls")
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
numbers.boys <- matrix(c(person.counts[1,1,],person.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls<- matrix(c(person.counts[1,2,],person.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys <- data.frame(Grade, Group, numbers.boys)
tab.girls <- data.frame(Grade, Group, numbers.girls)
plot.n.boys <- ggplot(tab.boys, aes(x = Grade, y = numbers.boys, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Boy Observations")+
ggtitle("Number of Boy Observations, Treated and Control")
plot.n.girls <- ggplot(tab.girls, aes(x = Grade, y = numbers.girls, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Girl Observations")+
ggtitle("Number of Girl Observations, Treated and Control")
village.counts <- with(dfPost,tapply(sooloca,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(village.counts)[[1]] <-c("Control","Treated")
dimnames(village.counts)[[2]] <-c("Boys","Girls")
numbers.boys.v <- matrix(c(village.counts[1,1,],village.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls.v<- matrix(c(village.counts[1,2,],village.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys.v <- data.frame(Grade, Group, numbers.boys.v)
tab.girls.v <- data.frame(Grade, Group, numbers.girls.v)
plot.n.boys.v <- ggplot(tab.boys.v, aes(x = Grade, y = numbers.boys.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Boy Observations")+
ggtitle("Number of Villages with Boy Observations")
plot.n.girls.v <- ggplot(tab.girls.v, aes(x = Grade, y = numbers.girls.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Girl Observations")+
ggtitle("Number of Villages with Girl Observations")
```
::: {.panel-tabset}
## Number Boys
quarto-executable-code-5450563D
```r
#| echo: false
#| column: page-right
ggplotly(plot.n.boys,tooltip="y")
```
## Number Girls
quarto-executable-code-5450563D
```r
#| column: page-right
#| echo: false
ggplotly(plot.n.girls,tooltip="y")
```
## Number Villages for Boys
quarto-executable-code-5450563D
```r
#| column: page-right
#| echo: false
ggplotly(plot.n.boys.v,tooltip="y")
```
## Number Villages for Girls
quarto-executable-code-5450563D
```r
#| column: page-right
#| echo: false
ggplotly(plot.n.girls.v,tooltip="y")
```
:::
## Conclusion
Overall, we find positive estimated effects of PROGRESA on school
enrollment, and estimate that the effect is larger for girls than for boys. The estimated effects vary widely by sex-grade cells, with the results the
most extreme and unexpected where we have the fewest number of observations and
thus expect the most sampling noise.
The next natural step is to conduct formal inference, for example, to test the
null hypothesis of no treatment effect including for subgroups. We
will return to hypothesis testing next time.
### References
::: {#refs}
:::